import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
Problem Statement: Sneakers are a relatively cheap item, given that these shops are selling sneakers, something seems to be wrong in our analysis. Our objective is to figure out what could be skewing the AOV and we can fix it.
data = pd.read_csv('sneaker_data.csv')
data.head()
data.shape
print(data['created_at'].min())
print(data['created_at'].max())
msno.matrix(data)
data.isnull().sum()
The missing value matrix allows us to visually see if the missing values in the dataset follow any pattern.
There are no missing values in this dataset.
# counts the number of unique shops in our data
print("Numbers of unique shops: "+ str(len(data.shop_id.unique())))
# calculate summary statistics of order_amount column
data.order_amount.describe()
Therefore, we can verify that:
order_amount column raises red flags¶This tells us us that the order amounts are spread out over an extremely wide range since the SD is 41,282.54 and although, 50% of the orders are below 284 dollars (and 75% orders are below 390.00)
Based on these facts, we have evidence which indicates our data may contain outliers..
fig = px.scatter(data, x="created_at",
y="order_amount",
color="shop_id",log_x=False,
size_max=60, template="plotly_dark",
hover_data=["payment_method", "total_items"],
labels={
"created_at": "Time",
"order_amount": "Order Amount (Dollars)",
},
title="Distribution of Sales" )
fig.show()
Sadly after uploading this notebook to gitbub, I found out that interactive plotly graphs do not render on github. Therefore, I'm up loading a png version of the graph in this notebook. Checkout my website https://hananather.github.io/ where I'm uploading the interactive version. I promise the interactive version is a lot better and conveys much more information.
order_amount: shop_id =42 and shop_id = 78order_amount, compared to the rest of the orders# shop id 42
df_shop_42 = data[data['shop_id'] == 42]
fig = px.scatter(df_shop_42, x="created_at",
y="order_amount",
color="shop_id",log_x=False,
size_max=60, template="plotly_dark",
hover_data=["payment_method", "total_items"],
labels={
"created_at": "Time",
"order_amount": "Order Amount (Dollars)",
},
title="Shop ID: 42 Distribution of Sales" )
fig.show()
# shop id 78
df_shop_78 = data[data['shop_id'] == 78]
fig = px.scatter(df_shop_78, x="created_at",
y="order_amount",
color="shop_id",log_x=False,
size_max=60, template="plotly_dark",
hover_data=["payment_method", "total_items"],
labels={
"created_at": "Time",
"order_amount": "Order Amount (Dollars)",
},
title="Shop ID: 78 Distribution of Sales" )
fig.show()
They two plots above give us two important missing details of the story:
Shop ID: 42 seems to have sold shoes in order sizes ranging from 1-3, however, it also regularly sold shoes in bulk orders containing 2000 shoes, which bring up the order values to 704,000 dollars
Shop ID: 78 just sells very expensive sneakers (25,725 dollars/pair)
Therefore, we can rule out the possibility that these extreme values were generated by an imputation error or some sort of mistake.
It would be useful to see what proportion of the revenue in order_value is generated by orders where 2000 shoes were sold. Furthermore, it would also be helpful to know what the distribution of total_orders is, i.e., in what percentage orders one sneaker is purchased versus in percent of orders 2000 sneakers are ordered.
It would be useful to see order_value categorically separated into distinct bins based on the total items in the order. For example, it would be helpful to know what proportion of the revenue in order_value is generated by orders where 2000 shoes were sold. Furthermore, it would also be helpful to know what the distribution of total_orders.
groupbyTotal = data.groupby('total_items')['order_amount'].sum().rename_axis('total_items').reset_index(name='total')
total_items_count = data.total_items.value_counts().rename_axis('total_items').reset_index(name='count')
groupbyTotal
total_items_count
We need reformat the total_items_count dataframe so the labels in both dataframes above are aligned otherwise the plot labels will not match
# aligning the dataframe labels
total_items_count = {'total_items':
[1,2,3,4,5,6,8,2000],
'count':
[1830,1832,941,293,77,9,1,17],
}
total_items_count = pd.DataFrame(total_items_count , columns = ['total_items', 'count'])
labels = [1,2,3,4,5,6,8,2000]
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=groupbyTotal['total'], name="Total Order Amounts"),
1, 1)
fig.add_trace(go.Pie(labels=labels, values=total_items_count ['count'], name="Count"),
1, 2)
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_layout(
title_text="Distribution of Order Amount and Total Items from 2017-03-01 to 2017-03-30",
# Add annotations in the center of the donut pies.
annotations=[dict(text='Total Order Amounts', x=0.145, y=0.5, font_size=12, showarrow=False),
dict(text='Number of Items in Order', x=0.95, y=0, font_size=16, showarrow=False)],template="plotly_dark")
fig.show()
Again! Sadly the that interactive plotly graphs do not render on github. Therefore, I'm up loading a png version of the graph in this notebook. Checkout my website https://hananather.github.io/ where I'm uploading the interactive version. I promise the interactive version is better.
In the above figure, the first plot indicates the proportion of total order amount when we group by the total_items. So we can see that the orders from store id: 42 where 2000 total shoes were purchased, account for 76.1% ($11,968,000) of the total order amount in our dataset.
However, in the second plot, we can see that orders where 2000 shoes were purchased only make up 0.34%(17/5000) of the total order in our dataset. Therefore, these relatively few observations generated by store ID: 42 massively skew the distribution order_amount.
Before we jump headfirst into calculating a better metric, it might be a good idea to brainstorm advantages and disadvantages of AOV.
Since AOV is a measure of central tendency, its only a good metric if the underlying distribution is not skewed. Since total items in the orders can range anywhere from 1 to 2000, grouping them together is not an effective strategy since it results in an AOV of 3145.128000 and standard deviation of 41282.54.
Furthermore, AOV also varies widely depending on different types of sneaker stores. For instance a highend sneaker stores selling premium products (such as Shop ID: 78) will tend to have a high AOV. On the other hand, a discount store selling low priced sneaker will tend to have a low AOV.
In addition to that, the issue with using AOV is that it does not take into to account the velocity of the orders. Having a high AOV might not help you as much if the bulk of your customers are one time buyers only.
There are other long term factors which can also potentially skew the AOV if data spans a larger time frame. If most of the shoe sales occur over the holidays, the overall AOV may be much higher than what the merchant experiences in off-season times.
A good metric should measure and report what is important to either the merchants, or Shopify (to improve the merachants experience selling).
So the question we should ask is: what metrics might be Shopify interested in? What metrics is the merchant interested?
No single number (metric or statistic) can capture the complexity of a distribution entirely. In many cases, looking at a single metric in isolation can be misleading. It may do more harm than good. In my experience, viewing a metric in a broader context with reference to other metrics is more insightful.
There are two main problems which skew the distribution of order_amount:
Problem (1):
total_items ranges anywhere from orders of one sneaker to two-thousand sneakersSolution:
Problem (2):
order_amountsSolution:
looking the standard deviation of the AOV along with the AOV gives us a better estimate of the variation of the order values.
additionally, Median Order Value (MOV) could also be insightful when our data is skewed
def metrics_calculator(df):
"""
A simple function calculates the AOV, Standard Deviation of AOV and Median Order Value
Parameters
----------
Pandas Dataframe with column 'total_items' and 'order_amount'
Returns
-------
Pandas DataFrame
"""
#calculate the AOV
groupbyTotalItems = df.groupby('total_items')['order_amount'].mean().rename_axis('Total Items in Order').reset_index(name='AOV')
# next we calculate the SD for AOV
SD = df.groupby('total_items')['order_amount'].std().rename_axis('Total Items in Order').reset_index(name='std')['std']
# calculate the Median Order Value
median = df.groupby('total_items')['order_amount'].median().rename_axis('Total Items in Order').reset_index(name='Median')['Median']
#adding the SD and Median Order Value column to the dataframe
groupbyTotalItems['Median Order Value'] = median
groupbyTotalItems['Standard Deviation'] = SD
return groupbyTotalItems
metrics_calculator(data)
Grouping the orders based on total items in order seems to give us more reasonable AOVs. However, the standard deviation is still relatively high. There is massive variation in order amounts, even when controlled for total items in the order.
The Median Order Value (MOV) gives us additional information about the distribution of order_amounts. The MOV tells us that 50% of the orders amounts were above the median, and 50% were below. However, the median can be a tricky metric when viewed in isolation. An increase in the median order value does not indicate whether or not sales went overall. Median is just the middle value of a set of numbers. Rather than tunnel visioning on a single metric, viewing at the AOV, MOV, and the SD in combination leads to a richer understanding of the distribution of order_amounts. The gap between the MOV and AOV gives a rough estimate of the skew in the data distribution. Additionally, the SD gives information about how much the order values vary.
Ultimately, the improved understanding of data allows Merchants and the Shopify Team to make more informed decisions.
SELECT ShipperName, COUNT(Orders.ShipperID)
FROM Orders
INNER JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID
GROUP BY Orders.ShipperID;
There were 54 orders shipped by Speedy Express.
SELECT LastName, COUNT(Orders.EmployeeID)
FROM Orders
INNER JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
GROUP BY Orders.EmployeeID
Order by COUNT(Orders.EmployeeID) Desc;
The last of of the employee with the most orders is "Peacock".
There can be two possible ways to interpret this question.
If we are interested in the quantity of the product
SELECT ProductName, MAX(Quantity)
FROM(
SELECT Products.ProductName as ProductName, OrderDetails.Quantity
FROM Orders
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID
INNER JOIN OrderDetails ON Orders.OrderID=OrderDetails.OrderID
INNER JOIN Products ON Products.ProductID=OrderDetails.ProductID
WHERE Country = 'Germany'
Group By Products.ProductID
Order by OrderDetails.Quantity Desc);
Steeleye Stout was purchased 100 times, was the most ordered item in Germany
However, if we are not interested in the quantity of the product, but simply care about which product was ordered most frequently by customers, we can using the following SQL Query:
SELECT ProductName, MAX(ProductCount)
FROM (
SELECT Products.ProductName as ProductName, COUNT(Products.ProductName) as ProductCount
FROM Orders
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID
INNER JOIN OrderDetails ON Orders.OrderID=OrderDetails.OrderID
INNER JOIN Products ON Products.ProductID=OrderDetails.ProductID
WHERE Country = 'Germany'
GROUP BY Products.ProductName
Order by COUNT(Products.ProductName) Desc
);
Gorgonzola Telino was ordered five times, making it the most ordered product in Germany.